One of the largest food companies in the world has engaged Streetbees to conduct Life Moments research to understand what people drink. Rather than relying on people's memory of what they eat, Streetbees has asked participants to log every meal they have for a full week by taking photos and telling us what they are drinking in the moment.
We’ve sent you a processed subset of the overall dataset from this drink consumption survey. This data contains each submission we captured during the survey, some demographic data for each user and a few relevant questions that were used to cluster these submissions.
This data was clustered using one of our algorithms, which returned clusters for each submission under the ‘cluster_id’ column. Using any tools you find useful, please present any analysis / insight you are able to gather from this data in order to help the client understand the profiles of these clusters.
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
sns.set_palette('magma')
import plotly.express as px
import plotly.graph_objects as go
df = pd.read_csv('clean_data_streetbees.csv')
df.head()
df.shape
Survey data is gathered over 520 days (1 year 5 months) from June 2nd 2019 to November 3rd 2020 in the US
df['created'] = pd.to_datetime(df['created'])
# Data is gathered from 2 june 2019 to 3 november 2020 in the US
df['created'].min()
df['created'].max()
# survey length
df['created'].max()-df['created'].min()
65% of survey answers are from females, which is almost twice as many as from men (34%), other genders 1%.
We can see that there are clusters where gender difference is more noticeable than in others.
Clusters 3, 6, 1, 2, 8 have significantly more females in them than men.
Clusters 5, 7, 9, 4 are closer to an equal split.
# 65% of survey answers are from women
round(df['gender'].value_counts(normalize=True)*100)
gender_df = pd.DataFrame(df.groupby('cluster_id')['gender'].value_counts(normalize=True)*100).rename(columns={'gender':'count'}).reset_index()
gender_df['cluster_id'] = gender_df['cluster_id'].apply(str)
gender_df = gender_df.sort_values('count', ascending=False)
# plotly gender by clusters
data = gender_df
fig = px.bar(data,
x= 'cluster_id',
y= 'count',
title='Gender differences in clusters',
hover_data=['cluster_id'],
labels={'count':'% of the cluster', 'gender':'Gender'},
barmode='group',
color=data['gender'].astype(str),
template = 'simple_white',
color_discrete_map={
'Female':'#e78671' ,
'Male':'#bf5275' ,
'Other':'#ffd667'})
fig.show()
38% of survey data is from people ages 45 and older,
24% from people ages 25-34,
22% from people ages 35-44
and last 15% is the youngest group of 18-24 year olds
Most popular clusters with group 45+ are 1, 7, 2, 5
Age group 35-44 fits better clusters 4, 7, 6
25-34 year olds are most popular in clusters 4, 3
And our youngest group 18-24 best recognised in clusters 8 and 9.
Interestingly cluster 4 is least popular for the oldest and youngest group, but most popular for the 25-34 and 35-44 year olds.
round(df['age'].value_counts(normalize=True)*100)
age_df = pd.DataFrame(df.groupby('cluster_id')['age'].value_counts(normalize=True)*100).rename(columns={'age':'count'}).reset_index()
# convert cluster_id to string for better ordering
age_df['cluster_id'] = age_df['cluster_id'].apply(str)
# order dataframe by values
age_df = age_df.sort_values('count', ascending=False)
# plotly age by clusters
data = age_df
fig = px.bar(data,
x= 'cluster_id',
y= 'count',
title='Age differences in clusters',
hover_data=['cluster_id'],
labels={'count':'% of the cluster', 'age':'Age group'},
barmode='group',
color=data['age'].astype(str),
template = 'simple_white',
color_discrete_map={
'45+':'#bf5265',
'35-44':'#e78671',
'25-34':'#bf5275',
'18-24':'#ffd667'})
fig.show()
86% of people who logged their meals/drinks were having them alone.
Only 13% in total were with their family or partner.
And the remaining few with friends, colleagues or others.
Clusters 1, 2, 8, 6 and 9 are characterised by people who had their meal/drink alone.
Cluster 3 has people who had their meal/drink with their partner, family or friends.
df['who_with'].value_counts(normalize=True)*100
with_df = pd.DataFrame(df.groupby('cluster_id')['who_with'].value_counts(normalize=True)*100).rename(columns={'who_with':'count'}).reset_index()
# convert cluster_id to string for better ordering
with_df['cluster_id'] = with_df['cluster_id'].apply(str)
# order dataframe by values
with_df = with_df.sort_values('count', ascending=False)
# plotly company by clusters
data = with_df
fig = px.bar(data,
x= 'cluster_id',
y= 'count',
title='With who differences in clusters',
hover_data=['cluster_id'],
labels={'count':'% of the cluster', 'who_with':'With who'},
barmode='group',
color=data['who_with'].astype(str),
template = 'simple_white',
color_discrete_map={
'Alone':'#813575',
'My partner':'#af4a77',
'My family':'#d86870',
'Colleagues':'#e78671',
'Friends':'#eeb68d',
'Other': '#f4e4b8'})
fig.show()
81% of people who answered the survey said they had their meal/drink at home.
9% were at school or work,
6% on the go or outdoors
Clusters 4 and 5 are only made up of people who had their meal/drink at school or work.
Clusters 1, 2, 3, 6, 8, 9 are of people who were at their homes.
People on the go or outdoors are found in cluster 7 as are the group of 'Cafe restaurant bar hotel'.
round(df['where'].value_counts(normalize=True)*100)
where_df = pd.DataFrame(df.groupby('cluster_id')['where'].value_counts(normalize=True)*100).rename(columns={'where':'count'}).reset_index()
# convert cluster_id to string for better ordering
where_df['cluster_id'] = where_df['cluster_id'].apply(str)
# order dataframe by values
where_df = where_df.sort_values('count', ascending=False)
# plotly where by clusters
data = where_df
fig = px.bar(data,
x= 'cluster_id',
y= 'count',
title='Where differences in clusters',
hover_data=['cluster_id'],
labels={'count':'% of the cluster', 'where':'Where'},
barmode='group',
color=data['where'].astype(str),
template = 'simple_white',
color_discrete_map={
'At my home':'#813575',
'On the go outdoors':'#af4a77',
'At school work':'#d86870',
'Cafe restaurant bar hotel':'#e78671',
"At someone else's home" :'#eeb68d',
'Other': '#f4e4b8'})
fig.show()
Overall 1/3 of people seem to be feeling good, great or amazing.
Followed by the next biggest group who is feeling just fine/neutral.
At least 6% of people said they were feeling anxious/stressed coupled with other mixed feelings.
Most noticeably we can see that sad/depressed people can be found in cluster 7.
round(df['feeling'].value_counts(normalize=True)*100).head(20)
feeling_df = pd.DataFrame(df.groupby('cluster_id')['feeling'].value_counts(normalize=True)*100).rename(columns={'feeling':'count'}).reset_index()
# only keep feelings that's percentage in the cluster is bigger than 4
feeling_df = feeling_df[feeling_df['count']>4]
# convert cluster_id to string for better ordering
feeling_df['cluster_id'] = feeling_df['cluster_id'].apply(str)
# order dataframe by values
feeling_df = feeling_df.sort_values('count', ascending=False)
# plotly feelings by clusters
data = feeling_df
fig = px.bar(data,
x= 'cluster_id',
y= 'count',
title='Feeling differences in clusters',
hover_data=['cluster_id'],
labels={'count':'% of the cluster', 'feeling':'Feeling'},
barmode='group',
color=data['feeling'],
template = 'simple_white',
color_discrete_map={
'Good great':'#813575',
'Great amazing':'#af4a77',
'Neutral fine':'#d86870',
'Sleepy tired':'#e78671',
'Anxious stressed' :'#eeb68d',
'Relaxed calm': '#f4e4b8',
'Sad depressed': '#3d1b62',
'Content upbeat': '#b49ed9'})
fig.show()
Main reasons people have their drinks are either to boost their energy, just because of the taste or because it's a habit.
Other popular reasons were also: out of thirst, to go with food, a healthy chpoice, it's refreshing qualities.
In cluster 2 we can predominantly find people who had their drink just because they liked the taste.
Routine loving people and those who have a drink to boost their energy can be found in cluster 1.
round(df['why_this'].value_counts(normalize=True)*100).head(10)
why_df = pd.DataFrame(df.groupby('cluster_id')['why_this'].value_counts(normalize=True)*100).rename(columns={'why_this':'count'}).reset_index()
# only keep feelings that's percentage in the cluster is bigger than 4
why_df = why_df[why_df['count']>4]
# convert cluster_id to string for better ordering
why_df['cluster_id'] = why_df['cluster_id'].apply(str)
# order dataframe by values
why_df = why_df.sort_values('count', ascending=False)
# plotly why_this by clusters
data = why_df
fig = px.bar(data,
x= 'cluster_id',
y= 'count',
color_discrete_sequence= px.colors.sequential.Sunsetdark_r,
title='Why this drinks in clusters',
hover_data=['cluster_id'],
labels={'count':'% of the cluster', 'why_this':'Why this'},
barmode='group',
color=data['why_this'],
template = 'simple_white')
fig.show()
Most popular activity while having a drink is watching TV with 23%.
Followed by doing nothing, relaxing, browsing social media and working/studying.
Clusters 6, 8 and 9 are likely to be watching TV when having a drink.
Clusters 4 and 5 on the other hand are the ones working or studying.
Every single cluster has some people who are browsing social media.
round(df['activity'].value_counts(normalize=True)*100).head(10)
act_df = pd.DataFrame(df.groupby('cluster_id')['activity'].value_counts(normalize=True)*100).rename(columns={'activity':'count'}).reset_index()
# only keep feelings that's percentage in the cluster is bigger than 4
act_df = act_df[act_df['count']>4]
# convert cluster_id to string for better ordering
act_df['cluster_id'] = act_df['cluster_id'].apply(str)
# order dataframe by values
act_df = act_df.sort_values('count', ascending=False)
# plotly activity by clusters
data = act_df
fig = px.bar(data,
x= 'cluster_id',
y= 'count',
color_discrete_sequence= px.colors.sequential.Sunsetdark_r,
title='Other activity while having their drink in clusters',
hover_data=['cluster_id'],
labels={'count':'% of the cluster', 'activity':'Activity'},
barmode='group',
color=data['activity'],
template = 'simple_white')
fig.show()
Our most popular drink 'Hot coffee' clearly defines cluster 1 with 'Hot tea'.
Hot tea on the other hand is mot popular in cluster 9 with no hot coffee in sight.
Soda is drunk through all clusters.
Diet sodas drinkers are defining cluster 5.
Clusters 1 and 9 do not drink any water, be it bottled or tap.
Beer and cider drinkers can only be found in cluster 9.
all_drinks = pd.DataFrame(df.groupby('cluster_id')['drink'].value_counts(normalize=True)*100).rename(columns={'drink':'count'}).reset_index()
# convert cluster_id to string for better ordering
all_drinks['cluster_id'] = all_drinks['cluster_id'].apply(str)
# only keep drink that's percentage in the cluster is bigger than
all_drinks = all_drinks[all_drinks['count']>5]
# order dataframe by values
all_drinks = all_drinks.sort_values('count', ascending=False)
# plotly
data = all_drinks
fig = px.bar(data, x='cluster_id', y='count',
hover_data=['cluster_id'],
labels={'drink':'Drink', 'count':'% in the cluster'},
barmode='group',
color=data['drink'].astype(str),
template = 'simple_white',
color_discrete_sequence= px.colors.sequential.Sunsetdark,
height=800)
fig.show()
People drink coffee first to boost energy, it's part of their routine and because they like the taste.
Soda is drunk because they like the taste and to boost energy.
Bottled and tap water both when people are feeling thirsty.
drink_why_df = pd.DataFrame(df.groupby(['drink', 'why_this']).count()).reset_index()
# choose specific columns
drink_why_df = drink_why_df[['drink', 'why_this', 'id']].rename(columns={'id':'count'})
# sort values
drink_why_df = drink_why_df.sort_values('count', ascending=False)
# only keep drink-why pairs where the pair appears more than 10 times
drink_why_df = drink_why_df[drink_why_df['count']>10]
import plotly.io as pio
# plotly drinks with reason
data = drink_why_df
fig = px.bar(data,
x= 'drink',
y= 'count',
color_discrete_sequence= px.colors.sequential.Sunsetdark_r,
title='Cluster 1 drink+reason',
hover_data=['why_this'],
labels={'count':'Count', 'why_this':'Reason', 'drink':'Drink'},
barmode='group',
color=data['why_this'],
template = 'simple_white')
fig.show()
pio.write_html(fig, file='index.html', auto_open=True)
It seems people have most of their drinks from 12pm to 6 pm.
# plot the hour of day when people drink
plt.figure(figsize=(12,6))
sns.set_style('white')
q = df['hour'].value_counts(normalize=True).sort_index()*100
y = q.values
x = q.index
ax = sns.barplot(x=x, y=y, palette='magma')
sns.despine(top=True, right=True)
# label each bar
for p in ax.patches:
height = p.get_height() # get the height of each bar
# adding text to each bar
ax.text(x = p.get_x()+(p.get_width()/2), # x-coordinate position of data label, padded to be in the middle of the bar
y = height+0.5, # y-coordinate position of data label, padded 0.2 above bar
s = '{:.0f}%'.format(height), # data label, formatted to ignore decimals
ha = 'center', # sets horizontal alignment (ha) to center
fontsize=13)
plt.xlabel('Hour of the day', fontsize=14)
plt.ylabel('% of when people drink', fontsize=14)
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.title('When do people drink?', fontsize=14);
Without a question most popular drink of choice is hot coffee.
This is followed by soda(carbonated soft drink), hot tea, bottled water and tap water.
# plot drinks
plt.figure(figsize=(12,10))
sns.set_style('white')
q = df['drink'].value_counts(normalize=True, ascending=False)*100
x = q.values
y = q.index
sns.barplot(x=x, y=y, palette='magma_r', orient='h')
sns.despine(top=True, right=True)
plt.xlabel('% of what people drink', fontsize=14)
plt.ylabel('Drink', fontsize=14)
plt.xticks(fontsize=13, rotation=90)
plt.yticks(fontsize=13)
plt.title('What do people drink?', fontsize=14);
all_drinks[all_drinks['cluster_id']==1]
d = {}
for i in range(1,10):
by_cluster = all_drinks[all_drinks['cluster_id']==i]
d[i] = by_cluster.set_index('drink')['count'].to_dict()
d[1].values()
from wordcloud import WordCloud
wordcloud = WordCloud(max_font_size=50, background_color="white")
for i in range(1, 10):
wordcloud.generate_from_frequencies(frequencies=d[i])
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title('Cluster: {}'.format(i), fontsize=14)
plt.show()